Stratified sampling biases models towards nonlinearity
نویسنده
چکیده
Motivation Kallel (2005, 2007) has claimed that the time course of the change from a negative concord system to an NPI-based one in the history of English is better modeled by a quadratic function in logistic space than by a linear one. The use of logistic functions to model language change (and specifically syntactic change) was championed by Kroch (1989), and since the publication of that work they are not unfamiliar. The form of models used is generally homogeneous: they are linear in the time variable (reflecting the character of the modeled process of change) and include categorial predictors taken from the linguistic and social context of the tokens (reflecting Kroch’s thesis that “the rate of use of grammatical options in competition will generally differ across contexts.” (1989)) Indeed, if Kallel’s proposal is accepted, it constitutes a fundamental challenge for the theory of syntactic change. To wit, it raises the question of the nature of the relation of parabolas (in logistic space) to language change; to have an empirically informed theory of language change such a relation must be understood. However, we will demonstrate that Kallel’s result is an artifact of stratified sampling. Predictive power The models given by Kallel (2007) are replicated in Table 1, using her dataset without binning. It is suspicious that the value of the coefficient of the year term is so small, but the χ test indicates that the reduction in deviance yielded by the quadratic model is significant (p = 8.119× 10−6). This is Kallel’s result. (Kallel 2007, (19)) Having derived these models, it is possible to ask what, if any, additional predictive power the quadratic one affords. To make such a determination, a cross-validation paradigm will be employed. The dataset is split into two pieces, one containing 95% of the tokens and the other containing the other 5%. Linear and quadratic models will be fit to the larger subset of the data. The parameter values thus obtained will then be used to predict the outcomes in the other subset of the data, given the model inputs (year and coordinate context) associated with each token. The results from this trial are given in Table 2. The improvement given by the quadratic model over the linear one is incredibly marginal. Both models perform markedly better than either of two baselines, one of which predicts a random number in the interval (0,1) as the probability of any (regardless of year or coordinate status) and the other of which always predicts 0.5. Simulation In order to further probe the properties of the two models proposed by Kallel and discussed here, we will construct a simulation procedure to generate a dataset similar in structure to Kallel’s corpus (consisting of 944 tokens). Bootstrap resampling from the corpus tells us that Kallel’s model is not simply overfit to the data, i.e. the observed drop in model deviance is robust provided that the sample is a good representation of the underlying population. It is precisely that proviso that is at issue, and the dangers of ignoring the high degree of correlation between linguistic tokens produced by a single speaker has been pointed out by Johnson (2009) and Gorman (2009) in the context of apparent-time sociolinguistic studies (the latter making reference to the negative concord variation in non-Standard Present Day English). It is this variation that we will attempt to capture in our simulation. The base of the simulation will be a generating model – either a linear or a quadratic logistic function – the parameters of which will be taken from the corresponding model fit to Kallel’s corpus. We will take the output of this function to be the “average” underlying rate of any use. We will then assume that the variation introduced into a corpus by non-random sampling (e.g. sampling more than once from the same author) can be modeled by a noise term added to the output of the generating model; this noise term will be normally distributed in logit space.1 It is then necessary to determine the standard deviation of the error term. In Table 3 are the average model deviations (a measure of goodness of fit) of a logit-linear model fit to 2000 simulated data sets resultant by varying the standard deviation of the noise in increments of 0.05. The closest match to the linear model fit to Kallel’s actual data (deviance = 725.9) is provided by the value 0.55. On the other hand, Kallel’s logit-quadratic model (deviance = 699) is most closely approximated by values of the noise standard deviation very close to 0. The variation expected to inhere in the data based on the sampling procedure is not found – it has been absorbed by the quadratic term in the model. Having determined the value for the noise parameter, it is possible to use the simulation experiment to test our confidence in the χ test. Specifically, we can count the number of simulated logit-linear datasets in which the test gives a significant p-value. If this is greater than the assumed Type I error rate (0.05), it will be evident that the non-independence in the data has invalidated the test procedure. Indeed, with σnoise = 0.55, we obtain a p-value less than 0.05 in 10.6% (of 2000 trials). This indicates that the significance of Kallel’s result is illusory – in data known to be generated from a logit-linear model with levels of noise comparable to those in the data, the χ test does not provide a reliable indicator of statistical significance. Future work The question raised by Kallel (2007) about the role of logit-linear models in the understanding of syntactic change remains unanswered. Traditional methods of data collection have been demonstrated to be inadequate for providing a definitive answer. Truly random sampling from the population of speakers should be used to avoid spurious results. The availability of large-scale corpora (Google Books, Project Gutenberg) makes such an undertaking feasible. If the problem of automatically extracting syntactic information from such sources can be addressed, the linearity hypothesis can be put to a fair test.
منابع مشابه
The Role of Atmosphere Feedbacks during ENSO in the CMIP3 Models. Part III: The Shortwave Flux Feedback
Previous studies using coupled general circulation models (GCMs) suggest that the atmosphere model plays a dominant role in the modeled El Niño–Southern Oscillation (ENSO), and that intermodel differences in the thermodynamical damping of sea surface temperatures (SSTs) are a dominant contributor to the ENSO amplitude diversity. This study presents a detailed analysis of the shortwave flux feed...
متن کاملSampling Survey of Heavy Metal in Soil Using SSSI
Much attention has been given to sampling design, and the sampling method chosen directly affects the sampling accuracy. The development of spatial sampling theory has lead to the recognition of the importance of taking spatial dependency into account when sampling. This text uses the new Sandwich Spatial Sampling and Inference (SSSI) software as a tool to compare the relative error, coefficien...
متن کاملApproach-Induced Biases in Human Information Sampling
Information sampling is often biased towards seeking evidence that confirms one's prior beliefs. Despite such biases being a pervasive feature of human behavior, their underlying causes remain unclear. Many accounts of these biases appeal to limitations of human hypothesis testing and cognition, de facto evoking notions of bounded rationality, but neglect more basic aspects of behavioral contro...
متن کاملWeighted Likelihood for Semiparametric Models and Two-phase Stratified Samples, with Application to Cox Regression
Weighted likelihood, in which one solves Horvitz-Thompson or inverse probability weighted (IPW) versions of the likelihood equations, offers a simple and robust method for fitting models to two phase stratified samples. We consider semiparametric models for which solution of infinite dimensional estimating equations leads to √ N consistent and asymptotically Gaussian estimators of both Euclidea...
متن کاملStratified Point Sampling of 3D Models
Point sampling is an important intermediate step for a variety of computer graphics applications, and specialized sampling strategies have been developed to satisfy the requirements of each problem. In this article, we present a technique to generate a stratified sampling of 3D models that is applicable across many domains. The algorithm voxelizes the model and selects one sample per voxel, res...
متن کاملStratified and Un-stratified Sampling in Data Mining: Bagging
Stratified sampling is often used in opinion polls to reduce standard errors, and it is known as variance reduction technique in sampling theory. The most common approach of resampling method is based on bootstrapping the dataset with replacement. A main purpose of this work is to investigate extensions of the resampling methods in classification problems, specifically we use decision trees, fr...
متن کامل